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Abstract 

Background: The identification of protein-protein interaction sites is a connputationally cliallenging task and 
important for understanding tlie biology of protein complexes. There is a rich literature in this field. A broad class of 
approaches assign to each candidate residue a real-valued score that measures how likely it is that the residue 
belongs to the interface. The prediction is obtained by thresholding this score. 

Some probabilistic models classify the residues on the basis of the posterior probabilities. In this paper, we introduce 
pairwise conditional random fields (pCRFs) in which edges are not restricted to the backbone as in the case of 
linear-chain CRFs utilized by Li etal. (2007). In fact, any 3D-neighborhood relation can be modeled. On grounds of a 
generalized Viterbi inference algorithm and a piecewise training process for pCRFs, we demonstrate how to utilize 
pCRFs to enhance a given residue-wise score-based protein-protein interface predictor on the surface of the protein 
under study. The features of the pCRF are solely based on the interface predictions scores of the predictor the 
performance of which shall be improved. 

Results: We performed three sets of experiments with synthetic scores assigned to the surface residues of proteins 
taken from the data set PlaneDimers compiled by Zellner etol. (201 1), from the list published by Keskin etal. (2004) and 
from the very recent data set due to Cukuroglu etol. (201 4). That way we demonstrated that our pCRF-based 
enhancer is effective given the interface residue score distribution and the non-interface residue score are unimodal. 
Moreover, the pCRF-based enhancer is also successfully applicable, if the distributions are only unimodal over a 
certain sub-domain. The improvement is then restricted to that domain. Thus we were able to improve the prediction 
of the PresCont server devised by Zellner etal. (201 1) on PlaneDimers. 

Conclusions: Our results strongly suggest that pCRFs form a methodological framework to improve residue-wise 
score-based protein-protein interface predictors given the scores are appropriately distributed. A prototypical 
implementation of our method is accessible at http://ppicrf.informatik.uni-goettingen.de/index.html. 



Background 

Protein-protein interactions are constitutive of almost 
every biological process. The ability to identify the 
residues that form the interaction sites of these complexes 
is necessary to understand them. In particular, it is the 
basis for new therapeutic approaches to treat diseases 
[1,2]. 

A great deal of work has been done on developing in- 
silico prediction methods. As already observed by Zhou 
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et al [3], these methods can be subdivided with respect 
to the kind of mathematical foundation invoked and with 
respect to the features or characteristics of the protein 
used. 

Residue-wise score-based prediction methods 

Let Xr be the data relevant for a residue r in a given protein 
chain. These methods then employ a function f {Xry'k)^ 
where X are some coefficients which have been learned 
through the training. The value of f{Xr,>^) then deter- 
mines, whether r is rated as an interface or not. The linear 
regression method [4,5], the scoring function method 
[6-11], the neural network method [12-17], and the sup- 
port vector machine method [18-25] are of this kind. 
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Probabilistic methods 

Let X be the data relevant for a protein chain, where 
these data are assumed to stem from a random source 
thus obeying a random distribution. X, which alternatively 
is called the observation, typically includes the structure. 
The label sequence of the residues Y that classifies each 
individual residue either as interface or as non-interface is 
assumed to be random, too. Typically, probabilistic meth- 
ods use the conditional probability distribution P (Y | X) 
to determine a classification y'' of the residues of maximal 
posterior probability P (y'' | x). Naive Bayesian meth- 
ods [26], Bayesian network methods [27], hidden Markov 
models (HMMs) [26], and linear-chain Conditional Ran- 
dom Fields taking the backbone as underlying graphical 
structure [28] fall in this category. Using posterior decod- 
ing on the basis of the forward-backward algorithm, both 
HMMs and CRFs are residue-wise score-based prediction 
methods, where the binary decision is made by threshold- 
ing the posterior probabilities of classifying the residues 
as interface. 

Notations 

We use Latin uppercase letters when referring to random 
aspects of the objects denoted by them. In contrast, low- 
ercase letters denote arbitrarily chosen but fixed objects. 
In this context boldface letters indicate vectors, the corre- 
sponding non-boldface letters their coefficients. 

The vast majority of methods use the 3D structure 
of the target protein chain in form of a PDB file as 
input [4-13,15,17-21,23-25]. However, a few methods are 
not requiring a 3D structure and rather use sequences 
only [14,16,22]. We here consider the problem with a 
given 3D structure of the target protein chain. Sequence- 
based input may include a multiple sequence alignment 
of related proteins from which, for example, sequence 
conservation can be inferred. When the 3D structure of 
an unbound binding partner is also available, protein- 
protein docking methods can be applied. This has also 
been exploited to provide feedback from docking to the 
more specific problem of interface prediction [29]. We 
here consider the case where the binding partner's 3D 
structure is not given. Nor requires the presented method 
the sequence of the binding partner. Albeit, we tested on 
homodimers only as we here rather focus on our new 
method rather than on features or types of proteins. The 
protein features used for interface prediction in the litera- 
ture are reviewed in the Methods section as far as we make 
use of them in this article. 

Most of the current studies for predicting interac- 
tion sites of proteins that use a probabilistic method 
are restricted by treating the residues of the proteins 
as independent vertices. Li et al have taken the back- 
bone neighborhood into account thus modeling the pro- 
tein as a sequence [28] using what can be called a 



line CRF or linear-chain CRF. The features they define 
on the label pair of two backbone neighbors have the 
effect of smoothing the predicted labels along the protein 
sequence. Decisive is, however, that they were the first 
who used conditional random fields (CRFs) for interface 
prediction. CRFs in turn have come into use for solving 
sequence labeling problems due to Lafferty et al [30]. See 
[31] for an overview. From the mathematical point of view 
they take advantage of the fact that they model the condi- 
tional probability P (Y | X) rather than the joint probability 
P (Y, X). Recently there has been an explosion of interest 
in conditional random fields (CRFs) with successful appli- 
cations. It has been shown that CRFs have the abilities 
for solving sequence labeling problems like part-of-speech 
tagging (POST) [32] and natural language processing [33] . 
Furthermore in the web extraction problem, in which the 
web-sites are modeled as two dimensional grid graphs, 
CRFs perform well [34]. One of their outstanding benefits 
over many other statistical models is that a CRF can easily 
describe the dependencies of observations. 

As proteins are folded into three dimensional struc- 
tures, spatial relationships create dependencies between 
residues. For example, we find on the test data described 
below that the correlation coefficient between spatial 
neighbors that are not also sequence neighbors (dis- 
tance < 3.5 A) is 0.45. This is only slightly lower than the 
correlation coefficient between residues that are sequence 
neighbors (0.49). As there are more than three times as 
many spatial pairs of neighbors than sequence neigh- 
bors at this threshold it is reasonable from a modeling 
standpoint to use a model that respects all dependencies 
induced by spatial proximity, not only the dependencies 
induced by proximity along the backbone. 

There are many papers using spatial neighborhood 
information of residues to predict-protein interaction 
sites (see e.g. [2,13,21,28]). However, the spatial infor- 
mation of proteins was only integrated into the feature 
functions, but not represented in the model. For proba- 
bilistic models, the difference between the two ways to 
integrate spacial information is that in previous models 
the label of the /-th residue Yi is conditional independent 
from the labels of other residues given data X and - in 
the case of linear CRFs or HMMs - given the labels of 
Yi-i and yi+i. Even when neighborhood information is 
only used for spatial smoothing of the labels, the intuitive 
advantage over, say, an SVM classifier that uses spatial 
neighborhood in the features but classifies each residue 
independently, is that not-patch-like candidate labelings 
are explicitly punished. In contrast, such an independent 
classifier- approach may have a tendency to predict indi- 
vidual interface residues 'sprinkled' around the protein 
surface [28]. 

For this reason, a general CRF seems to be more suit- 
able for the task. However, inference for general CRFs 
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is intractable. In this paper, pairwise conditional random 
fields (pCRFs) are utilized. Specializing general CRFs, only 
node cliques and edge cliques are taken into consideration 
in pCRFs. A pCRF retains most spatial information of pro- 
teins, can be specified with the same number of parameter 
as a line CRF and approximate inference remains feasible 
with the generalization of the Viterbi algorithm intro- 
duced here. Taking pattern from piecewise training meth- 
ods [35], we disentangled the labels of nodes and edges to 
train the model. 

In order to take advantage of a residue-wise score- 
based predictor, we model the protein surface by means 
of a pCRF, where the observation is solely a sequence of 
surface residue scores between 0 and 1 output by the pre- 
dictor. We then utilize a generalized Viterbi algorithm and 
piecewise training. The resulting tool tries to enhance the 
predictor chosen on the surface of the protein under study. 
It is the aim of this paper to demonstrate effectiveness of 
this approach provided that the interface residue scores 
and the non-interface residue scores are appropriately 
distributed. 

Methods 

We address the problem of improving residue-wise score- 
based predictors for protein interface residues as a node 
labeling problem for undirected graphs using the model 
class of conditional random fields (CRFs). Lafferty et al 
[30] were the first who applied CRFs to the problem 
of labeling sequence data. Li et al. [28] used line CRFs 
to address the interaction site prediction. They have the 
advantage that the Viterbi algorithm well-known from 
decoding HMMs can be used to efficiently infer the most 
likely labeling sequence. Very useful and illustrative pre- 
sentations on CRFs are given in [31,32,36,37]. Above CRF- 
based models make the assumption that the label of one 
residue is conditionally independent of the labels of all 
other residues given the labels of the two adjacent residues 
in the protein sequence. To the best of our knowledge, we 
are the first to employ a graphical model that takes the 
spatial neighborhood of residues located on the protein 
surface into account. 

This section is subdivided into three parts. We first 
explain how we model protein surfaces by pairwise CRFs. 
Then we introduce our new inference method. Finally, we 
elucidate our training method. 

Using conditional random fields to model protein surfaces 

For every protein under study that has n surface residues, 
a pair of random vectors (X, Y) is considered. The vector 
X is the observation that represents the knowledge about 
this protein that is utilized in the prediction, e.g. the 3D 
structure of the target protein and a multiple sequence 
alignment together with homologs. 



The vector Y is a random sequence of length n over 
the alphabet {I, N} that labels the index set {1, 2, . . . , w}, 
which in turn is called the set of positions (of the sur- 
face residues). The label I represents interface residues, 
whereas the label N represents non-interface residues. 
{I, N}" is the set of all label sequences of length n over 
{I, N}. We will also call them assignments as the term label 
sequence' may lead to confusion when applied below to 
subsets of {1, 2, . . . , «} that are not contiguous sequences. 

Let Q — (V, be the neighborhood graph, where V = 
{1, 2, . . . , is the set of positions, E is the set of edges that 
typically results from an atom-distance-based neighbor- 
hood definition for positions. We assume for convenience 
in notation that Q has no isolated nodes. Cases with iso- 
lated nodes could trivially be reduced to cases without 
isolated nodes. Let C be the set of Q\ cliques, which we 
refer to as node cliques. For a node clique c g C and an 
assignment y we denote by the restriction of y to the 
positions belonging to the node clique c. For c = {i} and 
c = {/,;} we write yt and rather than y{/} and y{/,y}. 

The preceding notation is also used in the slightly more 
general case of partial label assignments to arbitrarily cho- 
sen subsets S of the set of positions V. Formally, let y^ 
denote \ i e S, yi e {I, N}}. Given two partial 

assignments y^^ and are identical on 5i 0^2, the union 
y^i U is well-defined. 

The conditional distribution function of our pCRF (X, 
Y) with respect to the neighborhood graph Q = (V, S) is 
defined as follows: 

P(y|x) = -^exp K]OKy/,x)+ ^ (y/,y;,x) 
^^""^ V-V {i,j}e8 

+ J2 ^ciyc.x)], (1) 

ceC\(VUS) / 

where x and y are arbitrarily chosen instances of the 
random observation X and the random label sequence 
Y, respectively, ^c(yc>^) ^ • {c ^ C) is the feature 
of the CRF located at the node clique c (again and 
<^i^j simplify notation for Oj/} and and Z(x) is the 

observation-specific normalization factor defined by 

Z(x):= exp ^cD, (y/,x)+ ^ cD,-,y (y/,yy,x) 

ceC\{V^S) / 

Let us call ln(Z(x)P(y | x)) the score of the label sequence 
y given the observation x. 

A CRF is called a pairwise CRF (pCRF) if = 0, 
for all node cliques c larger than two. The remaining 
features and <^ij are referred to as node features 
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and edge features, respectively. Thus, every position / g 
V and every edge (/,/) g f is represented by the 
pair (Oi(N,x), Oi(I,x)) and by the quadruplet 
(N, N, X), <t>{i,j} (I, N, X), <t>{i,j} (N, I, X), ct>{i,j} (I, I, X)) . 

Following [30], we assume moreover that each node fea- 
ture and each edge feature is a sum of weighted base 
features. More precisely, for every position / e V and 
every edge {/,/} g f we assume representations 

k=i 

K2 

k=i 

where y G {I, N}'^ and x is an observation. The two real 
vectors 



(3) 



need to be calculated in a training phase. 

In the most general sense, protein characteristics are 
real-valued evaluations of positions and pairs of adjacent 
positions (edges of the neighborhood graph), respectively, 
that are correlated with our position labeling problem. 
We use a standard step function technique to obtain base 
features from protein characteristics, rather than taking 
the raw values of the characteristics. To make our paper 
self-contained, let us describe this technique for short. 

A protein characteristic depends on the observation 
and either a node or an edge. Each protein character- 
istic, such as e.g. the relative solvent-accessible surface 
area of a residue, is transformed into several binary fea- 
tures by binning, i.e. we distinguish only a few different 
cases rather than the whole range of the characteristic. 
Assuming the common case of real-valued characteris- 
tics, the bins are a partition of the reals into intervals. 
The use of this discretization allows to approximate any 
shape of dependency of the labels on the characteris- 
tics, rather than assuming a fixed shape such as linear or 
logarithmic. 

From protein characteristics for positions to node fea- 
tures. We subdivide the range of the characteristics C into 
say y intervals, where y is at least two. Let 5i < 52 < 
. . . < be the corresponding interval boundaries. It 
is reasonable to take 5^ as the ^/y-quantile of the empir- 
ical distribution of C for non-interface residues, where 
C(/,x) G (50, Then we define for each position / g V 
the following 2y base features associated with the position 
characteristics C. 



1 if yi = y and C(/, x) G (5^, 5^+1] ; 
0 otherwise; 



wherey = N, I, and^ = 0,1,.. .,}/ — land5o := — oo,5y := 
00. 

From protein characteristics for edges to edge features. 
Let D be the characteristics. Analogous to the previous 
case, we then obtain for each edge {/,;} g £ the follow- 
ing 4/ base features associated with A where y,/ G {N, 1} 
and ^ = 0, 1, . . . , y — 1. 



AD) /. . X 
(l)yj/i,j,yuyj,x):= 



1 if yi =y,yj =/, and D(iJ, x) G (5^, 5^+1]; 
0 otherwise. 

(5) 



In both cases we set y = 5. 

Devising a generalized Viterbi algorithm for pCRFs 

The problem of finding a most probable label sequence 
y* given an observation x is NP-hard for general pCRFs 
[31]. In this subsection we present a heuristic that approx- 
imately solves this problem. 

To this end, we first devise an algorithm, which we 
call generalized Viterbi algorithm. It computes an optimal 
label sequence, where the posterior probability of y* given 
X is maximized. Unfortunately, its run-time is in too many 
cases not acceptable. That is why we transform it in a 
second step into a feasible, time-bounded approximation 
algorithm. 

The generalized Viterbi algorithm 

Let Q = (V, £) be the neighborhood graph underlying the 
protein under study. For any assignment (label sequence) 
y and any subset of V, let yy/ denote the partial assign- 
ment of y with respect to V\ (This is in line with the 
notation y^ (c a position clique) introduced earlier in this 
study). 

If Vi, V2, . . . , Vr are pairwise disjoint position sets, the 
assignment for Vi U V2 U . . . U Vr canonically resulting from 
assignments y Vi > Y V2 > • • • > YVr is denoted by yvi Uyv2 U . . . U 
yv^. For V C V, the score sV^(yv^ I x) is defined by 



ieV 



(4) 



Then the problem of determining a most probable label 
sequence y* given an observation x can be reformulated as 

y* = argmax sv(y|x). 
y 

This is the case, because it suffices to consider the score. 

To put this into practice, we devised an algorithm we 
call generalized Viterbi. On the one hand, it is analogous 
to the classical Viterbi algorithm. On the other hand, there 
is a major difference. In our case there is no canonical 
order in which the positions of Q are traversed. Having 
explained our algorithm for any order, we show how to 
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calculate a fairly effective one. In what follows, we assume 
that the positions not yet touched are held in a dynamic 
queue. Those positions having already left the queue form 
the history set H <oV, 

Assume that the subgraph of Q induced by H has 
connected components Hi,^, • . For fi = 
1, 2, . . . , m, let Bj^ c be the so-called bound- 

ary component associated with defined by := 
{/ G \ 3j^l-L, {/,;} G f } . The complement Hf^\Bf^ 
is the interior of the /x-th history component. See Figure 1 
for an example. 

For assignments yBiiyB2^ • • • ^ySm the bound- 
ary components Si, S2, . . . , Sm> the Viterbi variables 
vit-Hi (y^i), vit'H2 (7^2)' • • • ' vit-Hm (Y^m) defined as 



At the very beginning the queue contains all positions, 
the history set 1-L and the corresponding boundary com- 
ponent B are empty. As long as the position queue is 
not empty, the top element v is extracted and the state is 
updated as follows. 

Adjoining v to the history set there are two cases to 
distinguish. Either position v is not adjacent to any other 
position of any old boundary component (see Figure 2) or 
adjoining position v to H results in adding it to some con- 
nected component of the old history set or even merging 
together two or more of them (see Figure 3). 

In the first case we simply have to take over all the 
old connected components, boundary sets and Viterbi 
variables. Moreover, we perform the instructions 



vit-H/x (y^J '= ^inax {yn^\B^ U y^^ |x) 
X (/X = 1, 2, ... , m). 

The Viterbi variables can be represented as a set of 
tables, one table of size 21^/^' for each boundary com- 
ponent S^. In the case where a boundary component is 
empty the table reduces to a single number. 

At any stage, the algorithm stores the connected 
components ^1,^2, . . of the current history 
set the corresponding boundary components 

Bi,B2> . . Bm> and Viterbi variable values vit^^^ 

(y^i ) > vit'H2 (y^2 ) - • - vit-H^ (YBm ) > where y^i , ys2 > • • - YBm 
range over all possible assignments of corresponding 
boundary component. We store for every assignment on 
the boundary, a maximizing interior assignment. This 
assignment is the argmax of (6) but is determined with 
the dynamic programming recursions defined below. Let 
us call these data the current state of the algorithm. It 
mainly consists of record sets indexed by the boundary 
labelings. 




/ 



b 




Figure 1 Example history set l-L = [a, b, c, d] hiaving boundary 



vit-H^+iG) 



{v}, vit^^^i(N) 



s-H^+iCNIx), 



In the second case position v is adjacent to some bound- 
ary components, say S^/, . . . , S^^. Then the old 
history components 1-Lm'> ^m'+iy - - - > and the current 
position V are merged together: 



^tmp ^ ^m' U ^m'-\-l U . . . U 



UM 



The other history set components and corresponding 
Viterbi variables are not affected. 

For fi = + 1, . . . , m, let 7^^ c B/j^ be the set of 

all positions out of that are no longer boundary nodes 
after having adjoined v to the history set. The nodes in 
IZ^ are removed from the boundary after the iteration. 
Let Bj^ be the complement of 7^^ in S^. By inspecting the 
edges incident to the current position v, all these sets can 
be computed in linear time. 

The new boundary set Btmp is then either B^' U S^^+i U 
. . . U Bm or Byn' U Byn'j^i U . . . U Bm U {v}, where it can be 
checked in linear time whether or not v is a new boundary 
position. 

We are now in a position to calculate the new Viterbi 
variables vit^,^^(y^,^^), where y^,^^ ranges over all 
assignments of the new boundary set iStmp- 

If V ^ Btmp then 



Ov {yyy x) + max 



fi=m' 



Here, any assignment of a node set is assumed to implic- 
itly define assignments for any subset thereof. Figure 4 
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illustrates this case of the recursion step. If, however, v g 
Btmp^ then 

"^^^ntmp ^Btmp ) ^ ^ V {Jv. X) + ^ ^v,W (j^V, j^W, x) 



(v,w)eS 



m J- 



(v,w)eS / 

Finally, the interior labeling is stored, where the maxi- 
mum is attained. The algorithm terminates after the last 
node V from V has been processed. In the typical case, 
where the graph is connected, at termination m = 1, Hi = 
V,Si = 0. 

The running time of the algorithm is 0(n2^), where b 
is the size of the largest boundary set and n is the num- 
ber of surface residues. We call this algorithm generalized 
Viterbi algorithm as for the case of a graph that is a linear 
chain 1 — 2 — 3 — — nof nodes using the node order 
1, 2, . . . , n the Viterbi variables we define are the same as in 
the standard Viterbi algorithm for HMMs. In the case of a 
graph that is a tree, this algorithm specializes to the Fitch 
algorithm or an argmax-version of Felsensteins pruning 
algorithm when a leaf-to-root node order is chosen after 
rooting the tree at an arbitrary node. In both special cases 



the boundary sets always have size at most 1. The tree 
example also motivate the use of several history sets at the 
same time: using a single history set only, one would not 
be able to achieve a linear running time on trees. 

A heuristic based on the generalized Viterbi algorithm 

First, it is vital for our generalized Viterbi algorithm to 
keep the size of the boundary sets small. A good position 
order is here of great importance. The algorithm starts by 
choosing a vertex of minimal degree. When determining 
the next position to be dequeued, the algorithm selects a 
boundary node such that the number of incident edges 
leading to nodes not belonging to any current history set 
is minimal. In an arbitrarily chosen order these nodes are 
dequeued next. 

Second, the space demand is reduced by restricting the 
number of boundary labelings admitted. Starting from the 
available labelings of the current history set, the percent- 
age of the reachable boundary labelings of the successor 
history that will be discarded is calculated. Then the 
corresponding percentile is estimated. To this end, a suf- 
ficiently large sample of possible labelings of the new 
boundary set is drawn, the Viterbi variables are computed, 
and the corresponding sample percentile is taken. Finally, 
only those boundary labelings of the new history set are 
retained whose Viterbi variables exceed this percentile. 

That way we compute near-optimal solutions good 
enough for our purposes within feasible computation 
time. 




Figure 3 Computing tlie connected components of tiie new iiistory set - case two. 



Dong etal. BMC Bioinformatics 2014, 15:277 
http://www.bionnedcentral.conn/l 471 -21 05/1 5/277 



Page 7 of 14 




Figure 4 Viterbi recursion step in case v ^ Btmp- After adding node v to the history set, I-L2 and 'Ks will be replaced by Htmp = 7^2 U 'Hs U {v}. 
In this example, for every assignment of the new boundary Btmp the score of Htmp is maximized by varying over the assignments of \/ and IZ3 and 
using the Viterbi variables ofV,2 and V,3. 



Piecewise training for pCRFs 

Let 

be the independent identically distributed training sam- 
ple. For every /x = 1,2, ... let V,^ and f ^ be the set 
of positions and edges in the neighborhood graph associ- 
ated with x^, let H/j^ = |V/x| be the number of positions of 
the /x-th training example and let {I, N}'^'^ be the set of all 
possible label sequences of this graph. 

This data set is unbalanced as there are many more non- 
interface positions as interface positions. As customary 
for other machine learning approaches such as support 
vector machines and artificial neural networks [28], we 
here manipulated the ratio of positive and negative exam- 
ple positions for training in order to obtain reasonable 
results. 

We have amplified the influence of the positive exam- 
ples, rather than selecting various sets of training data by 
deleting negative ones as done in [28]. 

Let v/, vjsf, vjj and vjsfjsf be the number of interface posi- 
tions, the number of non-interface positions, the number 
of interface-interface edges, and the number of non- 
interface-non-interface edges in V, respectively. Then we 
define the following two amplifier functions for all posi- 
tions / and for all edges {/,;'} of the m neighborhood graphs 
resulting from the training data V. 



mil) ' ■■ 



mil J) 



0 if:y/ 

vnn 



VII 
VN_ 

VI 



I; 
N. 

- 1 if yi = yj = I; 

1 if yi 7^ yp 
if yi = yj = N. 



To uniformly govern the influence of the amplifiers, we 
introduce an amplifier control parameter rj^ g [0, 1]. 
We set up our two log-likelihood objective function by 



Ki 



k=l 



K2 



m 



where ideally for each 11 = 1, 2, . . . , m 



K2 . 

{i,j]e£^ k=l 



is the training-instance-specific normalization factor. 

Unfortunately, maximizing this objective function in 
general is algorithmically intractable. Taking pattern from 
Sutton et al. [35] who introduced what they called piece- 
wise training, we deal with this problem by disentangling 
the labels of nodes and edges. For /x = 1, 2, . . . , m, a non- 
coherent labeling y e {I, N}^/^^*^/^ of the neighborhood 
graph (V/x, ^/x) is any mapping that assigns to every posi- 
tion V G V/x and every edge e g a label yv g {I, N} and a 
pair of labels y^ G {I, N}^, respectively. 
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We then replace Z (x^^^ ^3) by 

/ Ki 

K2 

X (v,yv,x^^^) + ^ (1 + mm{e))Y,Pk(l^k 

eeSfj^ k=l 

X (e,y„x(/^))) 

as normalization factor. This makes the optimization 
problem computationally feasible. 

The L-BFGS method [38] is used to solve it. That way we 
obtain the coefficient vectors a and p (see Equations 3), 
which depend on the amplifier control parameter ^73 G 
[0,1]. 

To mitigate the negative consequences of disentangle- 
ment, we use a correction factor 5 > 1. For any character- 
istics D and ^ = 0, 1, . . . , 7 — 1, the weights of the bases 
edge features and l (^^^ Equation 5) are all multi- 
plied by 5. Thus a change in classification along an edge is 
additionally penalized. The correction factor 5 is set best 
between 1.15 and 1.25. 

For our implementation of the training, we used the 
Java CRF package from Sunita Sarawagi at http://crf. 
sourceforge.net/. 

Results and discussion 

In this section we demonstrate effectiveness of our pCRF- 
based protein surface model to enhance residue-wise 
score-based predictions of protein-protein interfaces. For 
the sake of ensuring reliability of the methods we used 
three data sets. The first one is PlaneDimers due to Zellner 
et al, [25], the second one is the list of 1276 two-chain- 
proteins published by Keskin et al [39], which was used by 
\.\et al. [28] to test their linear-chain CRF. Third, we used 
a non-redundant data set containing 22604 unique inter- 
face structures very recently compiled by Cukuroglu et al 
and published in [40] . 

The data set PlaneDimers is less known than the 
data due to Keskin et al.. It consists of redundancy-free 
homodimers with flat protein-protein interfaces. Zellner 
etal. [25] developed an SVM, called PresCont, that assigns 
to each residue on the protein surface a score between 0 
and 1, which we refer to as PresCont score in the sequel. 
The larger the score, the more likely the residue belongs 
to the interface. Zellner et al. made the prediction by 
thresholding the score. The PresCont server and the data 
list PlaneDimers are publicly available (see http://www- 
bioinf.uni-regensburg.de/). 

In the first subsection we describe two sets of exper- 
iments performed with synthetic data, one on PlaneD- 
imers [25], the other one on the list published by 
Keskin et al. [39]. In both cases we independently 



assign to each surface position a random score drawn 
according to two different parametrized sequences of 
^-distributions Beta(ai(^)ft(^)) and Beta(QfN(^)i^N(^))> 
one for the interface sites determined by the reference 
labeling, the other one for the non-interface positions. 
The parametrized values ai{g), a^{g)y Pi(g) and ^n(^) 
determining the two sequences of distributions are cho- 
sen such that the following conditions are satisfied. The 
mean values ei > are the average PresCont scores on 
interface sites and non-interface sites of all chains from 
PlaneDimers. The variances and are equal to a^Qg 
and cr^ 0^' where a^Q and q are the corresponding vari- 
ances of the PresCont score, and g g {0.8, 0.9, 1.0, 1.1, 1.2} 
models the precision of the synthetic score. The deciding 
feature of all these distributions is that they are unimodal. 
The result of the subsection is that enhancement works 
for unimodal score distributions. 

The second subsection is about a synthetic data exper- 
iment on a new data set due to Cukuroglu [40]. Here we 
follow the line of the first subsection except for the fact 
that we restrict ourselves to signal precision g = 1.0. 

In the third subsection we study the PresCont scores for 
two-chain protein complexes from the data set PlaneD- 
imers. According to Figure 5, the PresCont score for non- 
interface residues is far from being unimodal. However, 
if one restrict oneself to the part above a threshold in 
the neighborhood of 0.5 and larger, one may ask whether 
enhancement restricted to that domain will works. The 
subsection answers this question in the affirmative. Hav- 
ing chosen a threshold as described above, one can 
improve the classification with respect to this threshold 
as follows. Take over the prediction for scores below the 
threshold and reclassify the residues the scores of which 
are above by means of the pCRF-based enhancer. 

In general, observations x could encompass a PDB file, 
which in particular determines the 3D-structure of the 
protein, together with an MSA that models evolutionary 
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aspects. In our case an observation solely consists of the 
PresCont score sequence or of the sequence of synthetic 
scores for the surface residues. Formally, every observa- 
tion X is equal to a vector (fi, • • • > ^n) ^ [0> l]'^- 

There are several neighborhood notions for residues, 
surface/core definitions and interface determinations in 
the literature. When studying the data set PlaneDimers, 
we follow [25]. In the case of the list due to Keskin et al. 
[39], the definitions according to [28] are used. Finally, 
when studying complexes taken from the data set pub- 
lished in [40], we take the following definitions. The RASA 
value of a surface residue is at least 15% (see [28]). Two 
residues are defined as contacting if the distance between 
any two of their atoms is less than the sum of the corre- 
sponding van der Waals radii plus 0.5 A (see [40]). 

Anyway, according to Keskin et al, [39] we define the 
distance of two residues on one and the same chain as the 
distance of their major carbon atoms. We then say that 
one residue is nearby another residue, if they are at dis- 
tance below 6 A. (Note that usually residues adjacent on 
backbone are at distance of less than or equal to 3.5 A). 
This definition in turn is the basis of the neighborhood 
graph Q = (V, S) underlying the pCRF. Two surface posi- 
tions are joined together by an undirected edge if and only 
if the corresponding residues are nearby ones. 

Our pCRF-based enhancer utilizes one position charac- 
teristic and two edge characteristics on the basis of the 
standard step function method explained in the Methods 
section. If x = (^i,^2> • • • > ^n) ^ [0> 1]^ is the observation 
associated with the protein under study, and HQ = (V, 8) 
is the neighborhood graph, then for every position / e V 
and every edge (/,;) e E we set 

C(/, x) := Di (/,;, x) := max{ f } D2 {hj, x) := | f / - | . 

To enhance predictions obtained by thresholding, solely 
information coming from the residue neighborhood rela- 
tions on the surface is additionally used. 

In order to be able to calculate the performance measure 
of area under the ROC curve (AUG) for our pCRF-based 
enhancer on synthetic scores, we proceed as follows. For 
each edge {/,;} e 8, we replace the local feature value 
h x) by /c-0/,y(I, I, x), where k e (0, 00). 

We enhance residue-wise score-based predictors only 
on the protein surface. In our synthetic data experiments 
there is no predictor available for core residues. For pro- 
teins taken from the data list published by Keskin et al 
[39] it happens that interface sites belong to the core. That 
is why we use what we call Surface AUC Ratio T of the 
enhancer as our performance measure for our synthetic 
data experiments. 



If r is greater than 1, the enhancement was successful. 
The larger F, the greater success. 

To estimate performance measures, we applied 5-fold 
cross-validation experiments. 

A fully built-out pCRF-based tool box for modeling 
protein surfaces needs to comprise all the standard algo- 
rithms as e.g. forward-backward techniques, marginal- 
ization and posterior decoding known for HMMs and 
linear-chain CRFs. To begin with, in the fourth subsec- 
tion we explain how to put a variant form of the for- 
ward algorithm and posterior decoding for pCRFs into 
practice. 

Simulating unimodal scores of various precisions 

We estimated means ei and and variances cTq j and 
(7q of the PresCont score on interface sites and non- 
interface positions of PlaneDimerSy respectively, as 
follows. 

ei = 0.61488 = 0.40590 
(Jo I = 0.03991 (Jo N = 0.04006 

We randomly chose 120 instances under the uniform 
distribution from the data set published by Keskin et al 
[39] to perform our experiments. Let us refer to this set 
as KL'Subset in the sequel. (It is accessible at http://ppicrf. 
informatik.uni-goettingen.de/index.html). 

Zellner et al [25] used the following determinations. A 
residue is defined to be part of the protein surface, if its 
relative solvent-accessible surface area is at least 5% [17]. 
A surface residue is said to constitute an inter-facial con- 
tact, if there exists at least one atom of this residue which 
has a van-der-Waals-sphere at a distance of at most 0.5 A 
from the van-der- Waals sphere to any atom from a partner 
chain residue [39]. 

Based on [3,12,15,20,41], Li etal [28] assume an inter- 
facial contact of a residue on a chain is assumed to be 
there, if any heavy atom of this residue is at distance of at 
most 5 A from any heavy atom from a partner chain. The 
relative solvent-accessible surface area of surface residue 
is at least 15%. 

We independently assigned to each interface surface 
residue of the two data sets a random score between zero 
and one according to the y6 -distribution Beta(ai(^)^i(^)), 
and to every non-interface surface residue a score accord- 
ing to Beta(QfN(^)i^N(^))> where the score precision g 
satisfies 

g e {0.8,0.9,1.0,1.1,1.2}, (8) 



AUC referred to the protein surface of the enhancer 
AUC referred to the protein surface of the residue-wise score-based threshold predictor 
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and the parameters oii(g)> Pi(^)>oi'^(g)> Pt^(g) were cho- 
sen such that 



ei = 



2 ^ oii(g)Pi(g) 



(9) 



(c^i(^) + A(^)r(c^i(^) + A(^) + i) 



^2 _ OiN(g)PN(g) 

""^'"^^ " (a^(g) + ^n(^))' (c^n(^) + A(^) + 1) 



(10) 



The Surface AUC Ratios of the enhancer compared 
with the threshold predictor on PlaneDimers and the KL- 
subset are displayed in Table 1. There is an improvement 
of 8.4% - 9.3% on PlaneDimers and of 3.2% - 5.0% on the 
KL-subset. 

Moreover, we compared individual classification results 
obtained by thresholding the scores with pCRF-based 
enhanced predictions. Because of the fact that the speci- 
ficity of the threshold predictor can be easily changed 
by manipulating the threshold, we proceeded as follows. 
For every score precision, the pCRF-based enhancer has 
a well-defined specificity referred to the surface residues. 
We then chose the threshold such that the specificity of 
the threshold predictor is close to that of the enhancer. 
The results are shown in Table 2. The sensitivity is 
increased by 53% — 67% on the data set PlaneDimers and 
by 14% - 22% on the KL-subset. 

Table 1 and Table 2 justify the following conclusion. 
Enhancing the threshold prediction by our pCRF works 
provided that the distributions of the interface scores 
as well as the non-interface scores are unimodal. The 
enhancement for the data set PlaneDimers is larger than 
for the KL-subset. This might be caused by the plain inter- 
face geometry of the complexes taken from PlaneDimers. 

Utilizing the new data set due to Culcuroglu [40] 

As in the case of the KL-subset, we randomly chose 60 
dimers. We refer to the resulting list as CGNK-subset Hav- 
ing assigned synthetic scores according to Equations 7, 9 



and 10, where g = 1.0, we compared individual classi- 
fication results obtained by thresholding the scores with 
pCRF-based enhanced predictions in exactly the same 
way as we did for the KL-subset. The results are shown in 
Table 3. The sensitivity is increased by 22%. 

A main finding of Cukuroglu [40] relevant to protein- 
protein interface prediction is, that the average interface 
RASA value is greater than 40%. Since our method is 
designed to improve performance of a given residue-wise 
predictor, using this result is not in the scope of this paper. 
However, a CRF-based predictor integrating features for 
cliques of size greater than 2 is not beyond the range of 
current algorithmic capabilities. In such a model a fea- 
ture set that discretizes the mean RASA value of cliques is 
promising. 

Enhancing the PresCont server prediction on PlaneDimers 

For the sake of completeness, we shortly review the 
residue characteristics used by PresCont 

Relative solvent-accessible surface area 

For any residue a, the solvent- accessible surface area 
2iS2i{a) can be computed by e.g. the software library BALL 
[42] . Most of the classifiers known from the literature uti- 
lize this characteristic (see [43]). For PresCont the relative 
solvent- accessible surface area according to 



rasa(/3^) := 



asa(<3^) 



(11) 



is taken into operation, where asamax(<^) is the maximally 
possible accessible surface area of residue a [44]. 

Hydrophobicity 

Many interfaces possess a hydrophobic core surrounded 
by a ring of polar residues [45,46]. In order to reduce noise, 
in [25] the contribution of hydrophobic patches rather 
then the influence of individual residues is utilized. 

Residue conservation 

Measures of this type utilized in [25] are the Shannon 
entropy and the relative Shannon entropy of empirical 
residue distributions in MSA columns. As an alternative, 
empirical expectations of BLOSUM-based similarities are 
taken for them. 



Table 1 Classification results on PlaneDimers and the KL-subset, where the jS-distributions according to which the 



synthetic scores were drawn are defined by Equations 7, 8, 9 and 10 



PlaneDimers 


Score precision g 


0.8 


0.9 


1.0 


1.1 


1.2 


Surface AUC ratio r 


1.084 


1.091 


1.093 


1.089 


1.093 


Kl-subset 


Signal precision g 


0.8 


0.9 


1.0 


1.1 


1.2 


Surface AUC ratio r 


1.032 


1.039 


1.045 


1.045 


1.050 



Depending on the variances determined by g, the enhancer increases the AUC referred to the protein surface by 8.4%-9.3%. on PlaneDimers, and by 3.2%-5.0% on the 
KL-subset. 



Dong etal. BMC Bioinformatics 2014, 15:277 
http://www.biomedcentral.eom/1 471 -21 05/1 5/277 



Page 11 of 14 



Table 2 Comparing the enhancer with the threshold classifier of approximately equal specificity on synthetic scores 



assigned to surface residues of protein complexes taken from the data set PlaneDimers and the KL-subset 


Data Set Score Precision g 


Classifier 


Specificity 


Sensitivity 


MCC 


0.8 


Threshold Predictor 


0.9672 


0.2562 


0.3253 


Enhancer 


0.9666 


0.4281 


0.4911 


PlaneDimer 

0.9 


Threshold Predictor 


0.9618 


0.2556 


0.3077 


Enhancer 


0.9624 


0.4086 


0.4610 


1.0 


Threshold Predictor 


0.961 1 


0.2428 


0.2912 


Enhancer 


0.9612 


0.3872 


0.4379 


1.1 


Threshold Predictor 


0.9681 


0.2100 


0.2753 


Enhancer 


0.9677 


0.3307 


0.4045 


1.2 


Threshold Predictor 


0.9649 


0.2100 


0.2648 


Enhancer 


0.9647 


0.3213 


0.3854 


0.8 


Threshold Predictor 


0.9568 


0.2936 


0.3549 


Enhancer 


0.9577 


0.3586 


0.4210 


KL-subset 

0.9 


Threshold Predictor 


0.9533 


0.2843 


0.3369 


Enhancer 


0.9531 


0.3290 


0.3820 


1.0 


Threshold Predictor 


0.9570 


0.2559 


0.3152 


Enhancer 


0.9571 


0.2971 


0.3591 


1.1 


Threshold Predictor 


0.9615 


0.2279 


0.2949 


Enhancer 


0.9614 


0.2743 


0.3459 


1.2 


Threshold Predictor 


0.9604 


0.2199 


0.2828 


Enhancer 


0.9599 


0.2516 


0.3175 



Scores of local neighborhoods 

They are evaluated by means of log-odd ratios of neigh- 
boring residue pair frequencies in interfaces as opposed to 
residue pair frequencies on complementary protein sur- 
face areas. The resulting scores are averaged both over the 
neighborhood of the positions under study and the rows 
of the MSA associated with the protein. 

On the basis of Figure 5 we enhanced PresCont for 
thresholds 0 g [0.500,0.625]. The decisive factor for this 
choice is that the PresCont score distributions for inter- 
face sites as well as non-interface positions above 0 are 
"sufficiently close to" unimodal distributions. For every 
such 0, we set all scores less than or equal to 0 to zero 
and then left the classification of all surface residues to the 
pCRF modified as follows. The residues of score zero are 
not taken into account when it comes to discretizing the 
protein characteristics (see Equations 4 and 5). Let us call 
this enhancing above 0. 

Table 3 Comparing the enhancer with the threshold 
classifier of approximately equal specificity on synthetic 
scores assigned to surface residues of protein complexes 
taken from the CGNK-subset 



Classifier 


Specificity 


Sensitivity 


MCC 


Threshold predictor 


0.9399 


0.3782 


0.3387 


Enhancer 


0.9400 


0.3104 


0.2767 



To evaluate improvements we proceeded as when com- 
piling Table 2. For every threshold 0 under consideration 
another threshold 0' was chosen such that threshold- 
ing at 6' has the same specificity as enhancing above 0, 
The results are displayed in Table 4 and visualized for an 
individual protein by Figure 6. According to Table 4 the 
increase in sensitivity ranges from 4% to 7%. The true- 
positive predictions on the surface of the protein with 
PDB-Entry 1QM4 are compared in Figure 6, where again 
the specificity of the two classifiers is the same. 

Discussing posterior decoding 

As in the case of linear-chain CRFs, the generalized Viterbi 
algorithm can be transformed into a variant form of the 
forward algorithm. It might be the case that the following 
additional problem arises. 

Let vi, V2, . . . , be the ordering in which the positions 
of Q are traversed by the algorithm, and let I denote the 
set of position indices / < n such that v/ is not an ele- 
ment of the boundary S^^^ of the history set 'H^^^ at stage 
/. If / is not empty, we encounter an obstacle when it 
comes to sampling label sequences. For / g /, position 
Vi is not labeled in the course of the sampling procedure. 
That is why we augment the neighborhood graph Q so 
that those positions no longer exist, all predictions remain 
unchanged, and the order of magnitude of the running 
time is not increased. To this end, we complement the 
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Table 4 Enhancing above various thresholds on PlaneDimers, where PresConfs threshold was chosen such that the 



specificity approximately equals that of enhancing 





tp 


tn 


fp 


fn 


Spec. 


Sen. 


MCC 


Enhancing above 0.500 


2181 


23182 


4145 


1414 


0.848 


0.607 


0.362 


PresCont 


2100 


23197 


4130 


1495 


0.849 


0.584 


0.346 


Enhancing above 0.525 


2303 


22917 


4410 


1292 


0.839 


0.641 


0.373 


PresCont 


2206 


22912 


4415 


1389 


0.838 


0.614 


0.353 


Enhancing above 0.550 


2507 


22103 


5224 


1088 


0.809 


0.697 


0.375 


PresCont 


2419 


22102 


5225 


1176 


0.809 


0.673 


0.358 


Enhancing above 0.575 


2560 


21992 


5335 


1035 


0.805 


0.712 


0.380 


PresCont 


2463 


21915 


5412 


1132 


0.802 


0.685 


0.358 


Enhancing above 0.600 


2379 


22685 


4642 


1216 


0.830 


0.662 


0.376 


PresCont 


2253 


22780 


4547 


1342 


0.834 


0.627 


0.356 


Enhancing above 0.625 


2287 


23044 


4283 


1308 


0.843 


0.636 


0.376 


PresCont 


2136 


23049 


4278 


1459 


0.843 


0.594 


0.346 



The sensitivity increased that way by 4%-7%. For every pair of experiments, the number of true negatives (tn), false negatives (fn), false positives (fp) and true positives 
(tp) are displayed. 



ordering vi, V2, . . . , v^^ as follows. For every / e /, we 
insert a new node between v/ and v/+i. Having extended 
the neighborhood graph by these nodes not being asso- 
ciated with residue positions of the protein under study 
and by new edges {v/, v/} (/ g /), where for / g / and 
yo-yi^yi ^ {N,l}^.(yo,x) = O{v,.,^.}0/i,3/2,x) = O, the 
above mentioned obstacle is eliminated without any influ- 
ence on the prediction and the order of magnitude of the 
running time. 

Proceeding now in a way analogous to the classical case, 
in every formula that is a building block of the generalized 



Viterbi algorithm the following two steps of replacement 
need to be performed. 

First, for every position / G V, every edge (/,;) G f , every 
label Jo ^ N}, and every label pair (3/1,3/2) ^ {h N}^, we 
replace (yo,x) with exp (O/ (3/0,^)), and ^{ij} (3/1,3/2,^) 
with exp (3/1,3/2,^)). 

Second, we replace sums with products and then max- 
ima with sums. 

Thus we obtain as analogues of the Viterbi variables 
vit^^ (y23^ ) defined by Equation 6 what we call component 
forward variables cf^^ (Yb^)- 





Figure 6 Comparison of eniiancer and PresCont service of same specificity on the protein with PDB-Entry 1 QIVI4. (A) Green spheres on the 
left show the interface surface residues correctly predicted by both tools. (B) Red spheres on the right indicate additional true positives of the 
enhancer. 
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If n^^\ nf, ....n% and , , ...,B% are the con- 
nected components of the history set T4S^^ and the cor- 
responding boundary set B^^^ at stage / € {1,2, ...,«}, 
respectively, then the forward variable at stage / with 
respect to a boundary assignment y^{i) is defined as 

mi . . 

j=l ^ \ W 

For any assignment y^o) (/ > 1), the forward variable 
fi (y230)) is a nontrivial linear combination of forward vari- 
ables fi-i(y5(/-i)), where y^a-i) ranges over some assign- 
ments of the boundary set B^^~^^ at stage Analogous 
to the linear-chain case, a random backward walk through 
a state graph, with all possible assignments y^(o (/ = fz — 
l, . . . , 1) being the set of nodes, results in a random label- 
ing of the positions, where each labeling is drawn with its 
posterior probability. 

This sampling technique allows the efficient calcula- 
tion of posterior probabilities at nodes and edges in a 
straightforward manner. 

Conclusions 

Residue-wise score-based threshold predictors of protein- 
protein interaction sites assign to each residue of the pro- 
tein under study a score. The classification is then made by 
thresholding the score. In case of using probabilistic data 
models, the parameters of the threshold predictor have 
been learned on a training data set in advance. 

We have demonstrated that such threshold predictors 
can be improved by pCRF-based enhancers given the 
shape of the interface surface score distribution and the 
non-interface surface score distribution with respect to 
the training set resemble the shape of unimodal distribu- 
tions. Besides the surface residue scores, only the spatial 
neighborhood structure between the surface residues of 
the protein under study is taken into account. Thus, the 
improvement can be attributed to our model. In addition 
to the precision of the scores, the amount of improve- 
ment depends on the 3D-complexity of the interfaces to 
be predicted. To this end, three sets of experiments with 
synthetic surface residue scores for protein complexes 
randomly chosen from the data set PlaneDimers compiled 
by Zellner et al [25] and from the lists published by Keskin 
et al [39] and Cukuroglu et al [40] . 

The enhancement is structurally based on the follow- 
ing model property of pCRFs in contrast to residue-wise 
predictors. Though the scores of near-by residues may be 
correlated, labeling a position as interface or non-interface 
by thresholding the score does not influence the classi- 
fication of its neighbors. When using pCRFs, this is the 
case. 

The pCRF-based enhancer is also applicable, if the 
score distributions are only unimodal over a certain 



sub-domain. The improvement is then restricted to that 
domain. Thus we were able to improve the prediction of 
the PresCont server devised by Zellner et al. on PlaneD- 
imers [25]. 

The prediction is made on grounds of a generalized 
Viterbi inference heuristic. As for training, we devel- 
oped a piecewise training procedure for pCRFs, where the 
enhancer needs to be trained on data originating from the 
same source as the training data of the threshold predictor 
to be improved. 

A prototypical implementation of our pCRF-based 
method is accessible at http://ppicrf.informatik.uni- 
goettingen.de/index.html. 
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